Aditya Negi, Harsh Gupta, Yash Gupta
2023-04-11
In particular:
Females of L. carteri are “persistent, diurnal biters” and attack the legs of humans as well as the upper body parts, while females of L. torrens do not bite the legs.
L. carteri breeds in level terrain, while L. torrens comes from sloping, sometimes steep terrain
Males of L. carteri do not form swarms, while males of L. torrens form conspicuous swarms.
The sex ratio of emerging L. torrens. was 1.64 females per one male, while it was only 1.33 females to one male for L. carteri.
The females of L. torrens laid 30.4 eggs on average in laboratory conditions, while the females of L. carteri laid only 13.9 eggs on average.
“Based on these biological differences, Wirth & Atchley (1973) resurrected carteri from synonomy but were able to find only a single consistent and objective morphological difference between the 2 taxa, i.e., a microscopic pubescence on the eyes.” That is, the only distinguishing feature between the two species was that L. torrens lacked a fine layer of hair over their eyes!
This was the motivator behind the original 1975 statistical analysis. William Atchley wanted to find “additional distinguishing features between the two species”.
The flies were identified on the basis of morphological differences. Because of storage requirements, only 70 specimens could be used; hence, 35 specimens were chosen from each species.
In the original 1975 study, the techniques used were canonical covariate analysis, stepwise discriminant analysis, multidimensional scaling, and multivariate ANOVA.
Our dataset is a truncated version of the original; of the original 13 variables, we have only 7. Our goals are twofold.
We want to evaluate the validity of the original statistical analysis. In particular, do the assumptions in the original hold? Do we obtain the same results as they do with a smaller subset of the variables?
We would like to use techniques which were not used in the original analysis to obtain more insight.
The data consists of 70 rows and 8 columns. They are as follows:
Group: 0 for L. torrens, 1 for L. carteri.
Wing length, shortened to “WL”
Wing width, shortened to “WW”
Third palp length, shortened to “TPL”. (Palps are organs between the antennae responsible for smell, touch, and taste)
Third palp width, shortened to “TPW”.
Fourth palp length, shortened to “FPL”.
Length of twelfth antennal segment, shorted to “ASTW”
Length of thirteenth antennal segment, shortened to “ASTH”.
Except from Group, which is categorical, all of these variables are numeric.
## Group WL WW TPL TPW FPL ASTW ASTH
## 1 0 85 41 31 13 25 9 8
## 2 0 87 38 32 14 22 13 13
## 3 0 94 44 36 15 27 8 9
## 4 0 92 43 32 17 28 9 9
## 5 0 96 43 35 14 26 10 10
## 6 0 91 44 36 12 24 9 9
## 'data.frame': 70 obs. of 8 variables:
## $ Group: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ WL : int 85 87 94 92 96 91 90 92 91 87 ...
## $ WW : int 41 38 44 43 43 44 42 43 41 38 ...
## $ TPL : int 31 32 36 32 35 36 36 36 36 35 ...
## $ TPW : int 13 14 15 17 14 12 16 17 14 11 ...
## $ FPL : int 25 22 27 28 26 24 26 26 23 24 ...
## $ ASTW : int 9 13 8 9 10 9 9 9 9 9 ...
## $ ASTH : int 8 13 9 9 10 9 9 9 9 10 ...
As a first step, we’re interested in comparing the variables in the two distinct datasets.
There is some separation betwe↑en the measures of central tendency for almost all of the variables.
The third palp length and fourth palp length of the two groups are especially distinct.
There is an extreme outlier in wing width. It is quite likely a mismeasurement and a strong candidate for removal.
## Group WL WW TPL TPW FPL ASTW ASTH
## 34 0 90 40 37 12 22 9 10
## 35 0 104 46 37 14 30 10 10
## 36 1 86 19 37 11 25 9 9
## 37 1 94 40 38 14 31 6 7
## 38 1 103 48 39 14 33 10 10
Many of the variables are “widely spread”. Further, the antennal segment variables exhibit “bumps” because they are discrete in nature. It may be difficult to transform them to normality.
Before classifying into groups, it is quite clear from the density plots that most of the variables are very far from normal.
Wing length is significantly correlated with every other variable. Wing width is significantly correlated with every variable other than the thirteenth antennal segment length. This may suggest that these two variables are indicators of “general size”.
The palp lengths are significantly correlated with each other; third palp length is not significantly correlated with third palp width, but fourth palp length is.
The antennal segments are, perhaps unsurprisingly, very strongly correlated with each other.
With seven different variables, it is difficult to directly guess about “separation” between the two groups. Graphing the first few principal components may indicate clear differences and suggest more detailed investigation.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 7.4317 3.5747 3.0654 2.56232 1.50634 1.28018 0.44094
## Proportion of Variance 0.6271 0.1451 0.1067 0.07455 0.02576 0.01861 0.00221
## Cumulative Proportion 0.6271 0.7722 0.8789 0.95342 0.97918 0.99779 1.00000
Clearly, there is some evidence that the groups are “separate”, but plenty of observations from group 1 are mixed in with group 0. Does plotting the third principal component help?
It doesn’t seem that adding a third principal component to the visualisation helps much.
According to the PCA, two linear combinations of the variables explain most of the variation in the data. This suggests that factor analysis might be helpful.
We quickly verify that the data may indeed be usefully factor-analysed using the KMO test and Bartlett’s test for sphericity.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = fly[, -1])
## Overall MSA = 0.69
## MSA for each item =
## WL WW TPL TPW FPL ASTW ASTH
## 0.76 0.77 0.74 0.81 0.76 0.58 0.56
## R was not square, finding R from data
## $chisq
## [1] 171.5042
##
## $p.value
## [1] 1.318656e-25
##
## $df
## [1] 21
First, we perform principal component factor analysis. We can do this even though we have not yet verified the assumption of multivariate normality. Because two principal components explain most of the variation in the data, we suspect that two factors may be enough.
## Principal Components Analysis
## Call: principal(r = fly[, -1], nfactors = 2, rotate = "none", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## WL 0.84 -0.10 0.72 0.285 1.0
## WW 0.72 -0.22 0.57 0.425 1.2
## TPL 0.54 -0.37 0.43 0.570 1.8
## TPW 0.55 -0.26 0.37 0.634 1.4
## FPL 0.64 -0.45 0.60 0.396 1.8
## ASTW 0.60 0.72 0.88 0.117 1.9
## ASTH 0.58 0.75 0.90 0.099 1.9
##
## PC1 PC2
## SS loadings 2.93 1.54
## Proportion Var 0.42 0.22
## Cumulative Var 0.42 0.64
## Proportion Explained 0.65 0.35
## Cumulative Proportion 0.65 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 28.72 with prob < 0.00035
##
## Fit based upon off diagonal values = 0.93
Next, we attempt a varimax rotation.
## Principal Components Analysis
## Call: principal(r = fly[, -1], nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## WL 0.76 0.37 0.72 0.285 1.5
## WW 0.73 0.21 0.57 0.425 1.2
## TPL 0.66 -0.02 0.43 0.570 1.0
## TPW 0.60 0.08 0.37 0.634 1.0
## FPL 0.78 -0.03 0.60 0.396 1.0
## ASTW 0.11 0.93 0.88 0.117 1.0
## ASTH 0.08 0.95 0.90 0.099 1.0
##
## RC1 RC2
## SS loadings 2.52 1.96
## Proportion Var 0.36 0.28
## Cumulative Var 0.36 0.64
## Proportion Explained 0.56 0.44
## Cumulative Proportion 0.56 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 28.72 with prob < 0.00035
##
## Fit based upon off diagonal values = 0.93
The varimax rotation is easier to interpret. The first factor is heavily loaded on the non-antennal segment data, while the second factor is highly loaded on antennal length data. Hence we can roughly call the first factor the “non-antennal factor” and the second the “antennal factor”.
We can summarise our factor loadings as follows:
Factor analyses were attempted for 3 and 4 factors, but they yielded unsatisfactory results.
The observations from each population should be independently drawn.
Each population is multivariate normal.
The populations should have a common covariance matrix \(\Sigma\).
The first of these assumptions we may take to be true, because these are observations from different specimens. It remains to verify 2) and 3).
Before going to test multivariate normality, we will first test univariate normality of each of the variables.
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 2] and Group
##
## Level Statistic p.value Normality
## 1 0 0.9510978 0.122472063 Not reject
## 2 1 0.8840922 0.001517748 Reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 3] and Group
##
## Level Statistic p.value Normality
## 1 0 0.9444748 7.666707e-02 Not reject
## 2 1 0.6613229 9.837494e-08 Reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 4] and Group
##
## Level Statistic p.value Normality
## 1 0 0.9210176 0.01525362 Reject
## 2 1 0.9663625 0.35136571 Not reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 5] and Group
##
## Level Statistic p.value Normality
## 1 0 0.9328069 0.03392550 Reject
## 2 1 0.9343470 0.03773533 Reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 6] and Group
##
## Level Statistic p.value Normality
## 1 0 0.9641591 0.303597278 Not reject
## 2 1 0.8987116 0.003651696 Reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 7] and Group
##
## Level Statistic p.value Normality
## 1 0 0.796832 1.769595e-05 Reject
## 2 1 0.918061 1.254047e-02 Reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : fly[, 8] and Group
##
## Level Statistic p.value Normality
## 1 0 0.7963882 1.734819e-05 Reject
## 2 1 0.9095155 7.197832e-03 Reject
## --------------------------------------------------
Most of the variables fail to demonstrate any kind of univariate normality. We must resort to some transformation of the data to attain normality.
Firstly, we will remove observation 36, which was a prominent outlier with regards to wing width.
Now, we will apply the Box-Cox transformation to each variable.
# lambda = 2 (MLE)
transformed.flyr$WL <- bcPower(flyr$WL,2)
nor.test(WL~Group,data=transformed.flyr, plot=NULL,alpha=0.01)##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : WL and Group
##
## Level Statistic p.value Normality
## 1 0 0.9439853 0.074064433 Not reject
## 2 1 0.8935651 0.003114805 Reject
## --------------------------------------------------
transformed.flyr$WW = bcPower(flyr$WW, 3)
nor.test(WW~Group,data=transformed.flyr,plot=NULL, alpha=0.01)##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : WW and Group
##
## Level Statistic p.value Normality
## 1 0 0.9526625 0.1367946 Not reject
## 2 1 0.9650367 0.3387956 Not reject
## --------------------------------------------------
transformed.flyr$TPL = bcPower(flyr$TPL,0.5)
nor.test(TPL~Group,data=transformed.flyr,plot=NULL, alpha=0.01)##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : TPL and Group
##
## Level Statistic p.value Normality
## 1 0 0.9152249 0.01041137 Not reject
## 2 1 0.9675149 0.39644333 Not reject
## --------------------------------------------------
transformed.flyr$TPW = bcPower(flyr$TPW,0)
nor.test(TPW~Group,data=transformed.flyr,plot=NULL, alpha=0.01)##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : TPW and Group
##
## Level Statistic p.value Normality
## 1 0 0.9454765 0.08228744 Not reject
## 2 1 0.9241424 0.02119192 Not reject
## --------------------------------------------------
For these three variables (FPL, ASTW, ASTH), the Maximum Likelihood Estimate of \(\lambda\) is 1, so the Box-Cox transformation cannot improve normality there.
For FPL variable, this estimate of \(\lambda\) can be justified by the fact that the data points have several extreme observations, which may indicate that the underlying distribution is heavy-tailed.
For antennal segment variables, this can be justified by the fact that these variables are very discrete in nature i.e. they are taking very few values. As a solution we will add jitter (random error) in the last two variables.
##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : ASTW and Group
##
## Level Statistic p.value Normality
## 1 0 0.9046967 0.005301109 Reject
## 2 1 0.9568060 0.195926659 Not reject
## --------------------------------------------------
##
## Shapiro-Wilk Normality Test (alpha = 0.01)
## --------------------------------------------------
## data : ASTH and Group
##
## Level Statistic p.value Normality
## 1 0 0.9465347 0.08867869 Not reject
## 2 1 0.9300098 0.03130258 Not reject
## --------------------------------------------------
Comments
Now that we have added jitter, ASTW & ASTH variables are also showing univariate normality. We may interpret the jitter to represent “rounding error” in the measurement.
We try to detect outliers in our transformed data matrix.
There are a few outliers, but due to the already scanty data, there does not seem much justification to arbitrarily remove them unless they cause trouble later on.
Now as univariate normality is established, we can go ahead and test multivariate normality.
We test this assumption using graphical methods (gamma plots) and hypothesis testing (Henze-Zirkler test of multivariate normality).
## [1] 2 15
## 51 46
## 15 10
For the first group, gamma plots suggest that the data is not multivariate normal.
For the second group, the plots suggest that we may be justified in assuming normality.
Let’s see if statistical tests are giving the same results.
## Henze-Zirkler test for Multivariate Normality
##
## data : tA[, -c(1)]
##
## HZ : 0.9847339
## p-value : 0.02539919
##
## Result : Data are not multivariate normal (sig.level = 0.05)
## Henze-Zirkler test for Multivariate Normality
##
## data : tB[, -c(1)]
##
## HZ : 0.8977875
## p-value : 0.5043688
##
## Result : Data are multivariate normal (sig.level = 0.05)
Comments
The statistical hypothesis testing shows that we are justified in assuming multivariate normality at 1% level of significance. So, we will assume the multivariate normality for transformed data for both the groups.
Being done with Multivariate Normality, now we proceed to test of homogeneity of Covariance Matrices of the two Groups.
\(H_0 : \Sigma_1=\Sigma_2\) vs \(H_1 : \Sigma_1 \ne \Sigma_2\)
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <dbl> <chr>
## 1 45.6 0.0191 28 Box's M-test for Homogeneity of Covariance Matric…
Comment
With the assumptions of MANOVA satisfied, we can proceed with testing the equality of the two mean vectors.
\(H_0 : \mathbf{\mu_1} = \mathbf{\mu_2}\) vs \(H_1 : \mathbf{\mu_1} \ne \mathbf{\mu_2}\)
## Df Pillai approx F num Df den Df Pr(>F)
## Group 1 0.49368 8.4967 7 61 3.189e-07 ***
## Residuals 67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comment
Note that for comparing two populations, ANOVA is equivalent to a two-sample t-test (equal variances).
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 1708484 1708484 5.312 0.0243 *
## Residuals 67 21548145 321614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 1.532e+08 153221530 5.777 0.019 *
## Residuals 67 1.777e+09 26521301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 7.376 7.376 42.66 1.03e-08 ***
## Residuals 67 11.584 0.173
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.0066 0.006629 0.503 0.481
## Residuals 67 0.8830 0.013179
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 352.1 352.1 25.91 3.11e-06 ***
## Residuals 67 910.4 13.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.26 0.2647 0.196 0.659
## Residuals 67 90.35 1.3485
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 0.39 0.3939 0.308 0.581
## Residuals 67 85.61 1.2778
Wing length and Wing width are significantly different at \(\alpha = 0.05\).
Third palp length and fourth palp length are significant even at \(\alpha = 0.01\).
The above results exactly match the results in the original 1975 analysis performed by Wiliam Atchley!
As all the above assumptions are satisfied for given data, we can proceed to test if the profiles are parallel.
\(H_0:\) Profiles are parallel. i.e. \(\mu_{1i}-\mu_{1i-1} = \mu_{2i}-\mu_{2i-1}; i=1,2,...,p.\)
\(H_1:\) Profiles are not parallel.
\(\textbf{Decision:}\) Reject \(H_0\) (parallel profile )at level \(\alpha\) if \[T^2 = (\bar{x_1}-\bar{x_2})'C' \left[ \left( \frac{1}{n_1} + \frac{1}{n_2} \right) CS_{pooled}C' \right]^{-1} C (\bar{x_1}-\bar{x_2}) > c^2 \] where, \[c^2 = \frac{(n_1 + n_2 -2)(p-1)}{n_1+n_2-p} F_{p-1,n_1+n_2-p} (\alpha) \]
## Calculation of p_value
n1=35
n2=34
p=7
C = matrix(c(-1,1,0,0,0,0,0,
0,-1,1,0,0,0,0,
0,0,-1,1,0,0,0,
0,0,0,-1,1,0,0,
0,0,0,0,-1,1,0,
0,0,0,0,0,-1,1),
nrow=6,ncol=7,byrow=TRUE)
S_pooled = 1/(n1+n2-2)*((n1-1)*S.tA+(n2-1)*S.tB)
# Test statistic
T_square = t( C %*% (txbar-tybar) ) %*% solve((1/n1+1/n2)* C %*% S_pooled %*% t(C)) %*% C %*% (txbar-tybar)
c_square = (n1+n2-2)*(p-1)/(n1+n2-p)
p_value = pf (T_square/c_square,p-1,n1+n2-p, lower.tail = FALSE)
print(p_value)## [,1]
## [1,] 9.933596e-08
Comments
The null hypothesis is rejected at \(\alpha = 0.01\); i.e. the profiles are not parallel.
We have assumed that
Prior probabilities for each of the 2 groups are same.
Each group incurs equal cost of misclassification.
We have already shown that
Transformed data is multivariate normal at 1% level of significance using Henze-Zirkler test.
Covariance matrices for different group are same at 1% level of significance using Box-M test.
Since all assumptions are satisfied, we’ll use the LDA the transformed data in two different ways:
By splitting data into training and testing sets
Using Leave one out cross validation.
## Call:
## lda(Group ~ ., data = train.data)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## WL WW TPL TPW FPL ASTW ASTH
## 0 4747.839 27242.30 9.953778 2.663342 25.96429 9.455446 9.432200
## 1 5041.339 29573.69 10.602066 2.693169 30.07143 9.828678 9.827236
##
## Coefficients of linear discriminants:
## LD1
## WL -3.954937e-04
## WW -7.093095e-06
## TPL 2.405805e+00
## TPW -2.598409e+00
## FPL 1.476042e-01
## ASTW 1.820921e-01
## ASTH -1.733004e-01
Comment On training data set, APER (misclassification rate) of LDA is about 14%.
Comment: On testing data set, misclassification rate of LDA is about 23%.
## [1] 0.2318841
Comments:
Estimate of AER = 0.23
\(\hat{\mathbb{P}}[L. torrens|L. carteri]\) = 10/34 = 0.235
\(\hat{\mathbb{P}}[L. carteri|L. torrens]\) = 6/35 = 0.171
For comparison, we may also use quadratic discriminant analysis.
Confusion matrix obtained from cross validation method
## [1] 0.2463768
Estimate of AER = 17/69 = 0.25
\(\hat{\mathbb{P}}[L. torrens|L. carteri]\) = 8/34 = 0.236
\(\hat{\mathbb{P}}[L. carteri|L. torrens]\) = 9/35 = 0.257
Both LDA and QDA have more or less the same misclassification rate. It’s not bad, but can we do any better?
KNN is an algorithm for classifying a new data point to predefined clusters.
The algorithm begins with the training data set, which consists of labeled data points (also known as instances) in the feature space.
When given a new, unlabeled data point, the algorithm looks for the K-nearest points in the training set to the new point based on a distance metric, such as Euclidean distance or Manhattan distance.
The K-nearest points are identified, and the label of the new point is assigned based on the majority label of those K-nearest points. For example, if K=3 and two of the nearest neighbors have a label of “A” and one has a label of “B”, then the new point would be classified as “A”. (in case of a tie it picks the label by tossing a coin.)
Before applying KNN, data should be normalized. This ensures that each feature has the same influence on the distance calculation and prevents the KNN algorithm from being biased towards certain features.
On training data set, the misclassification rate) of KNN is about 3%.
On testing data set, the misclassification rate of KNN is about 21%.
confusion.knn.cross =list()
for( i in 1:10){
confusion.knn.cross[[i]] = confusionMatrix(data=as.factor(as.integer(knn.cross.labels[,i])-1),reference=as.factor(flyr$Group))
}
knn.aer=matrix(0,nrow=10,ncol=1)
rownames(knn.aer) = c("k=1","k=2","k=3","k=4","k=5","k=6","k=7","k=8","k=9","k=10")
colnames(knn.aer) = c("Estimate of E[AER]")
for( i in 1:10) knn.aer[i,1]=round(1-confusion.knn.cross[[i]]$overall[1],3)
#Estimate of AER of KNN for different values of k
knn.aer %>%
knitr::kable(format = "html")| Estimate of E[AER] | |
|---|---|
| k=1 | 0.116 |
| k=2 | 0.116 |
| k=3 | 0.130 |
| k=4 | 0.116 |
| k=5 | 0.087 |
| k=6 | 0.130 |
| k=7 | 0.087 |
| k=8 | 0.145 |
| k=9 | 0.145 |
| k=10 | 0.116 |
Comment For k=5, estimate of AER for KNN is approximately 0.09 which is better than that of LDA.
Comment
Estimate of AER = 6/69 = 0.087
\(\hat{\mathbb{P}}[L. torrens|L. carteri]\) = 6/34 = 0.176
\(\hat{\mathbb{P}}[L. carteri|L. torrens]\) = 0/35 = 0
Because it is difficult to establish the assumptions of multivariate normality and homogeneous covarances for this dataset, we may approach the classification problem in a different way. Logistic regression offers a method of classifying the observations without requiring too many assumptions on the data.
The key assumptions under binary logistic regression are independence of observations and a binary dependent variable, both of which are satisfied by our data.
Multicollinearity is not really a problem because we’re interested in prediction rather than inference
We will run logistic regression followed by leave-one-out cross-validation to verify the results.
ma_model <- glm(Group~., family = binomial(link = "logit"), data=fly)
errvec<-rep(0, 70)
for (i in 1:nrow(fly)){
errvec[i]<-(predict(ma_model, fly[i, -1], type = "response")>0.5)
}
sum(errvec!=fly[, 1])/nrow(fly)## [1] 0.08571429
errvec<-rep(0, nrow(fly))
for (i in 1:nrow(fly)){
model<-glm(Group~., family = binomial(link = "logit"), data=fly[-i, ])
errvec[i]<-(predict(model, fly[i, -1], type = "response")>0.5)
}
sum(errvec!=fly[, 1])/nrow(fly)## [1] 0.1428571
Our first goal was to verify the results of the original analysis. This goal has borne rich fruit. Our analysis found significant differences between the mean vectors of the features of the two species. Moreover, significant differences were found in wing length, wing width, third palp length, and fourth palp length—which was exactly what the original 1975 analysis had found!
We performed classification using four different techniques: LDA, QDA, KNN, and binary logistic regression. All of these techniques yielded very satisfactory results.
We also performed a principal components analysis and a PC factor analysis on the data. Two principal components alone explain most of the variation in the data. Moreover, the factor analysis after a varimax rotation yields a very meaningful, interpretable result: the latent factors in the data are the “non-antennal measurements” and the “antennal measurements”.
We are heavily indebted to the original data analysis by William Atchley, which we shall enclose.
For further information on the HZ test of hypothesis, see here.